flowchart LR A([Type of splitting]) -.-> B([Partition]) A -.-> C([Hierarchical]) C -.-> D([Agglomerative]) C -.-> E([Divisive])
All clustering algorithms require a measure of dissimilarity.
We will mainly discuss distances between quantitative variables, but ditances can also be computed for categorical variables.
\[distance(A,B) = \sqrt{\sum_{i=1}^{n}(a_i - b_i)^2}\]
\[distance (A,B) = \sum|a_i - b_i|\]
\[distance (A,B) = \left(\sum|a_i - b_i|^{1/p}\right)^p\]
where Euclidean (p=2) and Manhattan (p=1) are the most common cases.
\[r = \frac{\sum(a_i-\bar a)(b_i-\bar b)}{\sqrt{\sum(a_i-\bar a)^2\sum(b_i-\bar b)^2}}\]
The correlation coefficient is a similarity measure, but can be transformed to a distance.
\[distance(A,B) = \sqrt{1-r}\]
flowchart LR A([Type of splitting]) -.-> B([Partition]) A -.-> C([Hierarchical]) C -.-> D([Agglomerative]) C -.-> E([Divisive])
\[{\mathbf m}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} {\mathbf x}_{i}\] - kmeans in R
Initialization. Select \(k\) initial centroids. The initial centroids can be selected in several different ways. Two common methods are
Assign each object to the closest centroid (in terms of squared Euclidean distance). The squared Euclidean distance between an object (a data point) and a cluster centroid \(m_k\) is \(d_i = \sum_{j=1}^p (x_{ij} - m_{kj})^2\). By assigning each object to closest centroid the total within cluster sum of squares (WSS) is minimized. \[WSS = \sum_{k=1}^K\sum_{i \in C_k}\sum_{j=1}^p (x_{ij} - m_{kj})^2\]
Update the centroids for each of the \(k\) clusters by computing the centroid for all the objects in each of the clusters.
Repeat steps 2 and 3 until convergence.
flowchart LR A([Type of splitting]) -.-> B([Partition]) A -.-> C([Hierarchical]) C -.-> D([Agglomerative]) C -.-> E([Divisive])
Linkage method - dissimilarity between clusters
To find concrete clusters based on HCL we can cut the tree either using height or using number of clusters.
The gap statistic measures the distance between the observed \(WSS\) and an expected \(WSS\) under a reference (null) model.
\(G(k) = E[\ln(WSS_{unif}(k))] - \ln(WSS(k))\)
Choose \(k\) as the smallest \(k\) such that \(G(k) \geq G(k+1) - s_{k+1}\).
The silhouette value for a single object \(i\) is a value between -1 ans 1 that measure how similar the object is to other objects in the same cluster as compared to how similar it is to objects in other clusters.
The average silhouette over all objects is a measure of how good the clustering is, the higher the value the better is the clustering.
for an object \(i\) in cluster \(C_a\):
\(s(i)\) well clustered \(\approx 1\), lies between clusters \(\approx 0\), incorrectly clustered \(\approx -1\).
Pvclust R package to assess the uncertainty in hierarchical cluster analysis (Suzuki and Shimodaira 2006)NbClust